Data selection and model combination in connectionist speech recognition
نویسنده
چکیده
The hybrid connectionist-hidden Markov model (HMM) approach to large vocabulary continuous speech recognition has been shown to be competitive with HMM based systems. However, the recent availability of extremely large amounts of acoustic training data has highlighted a problem with the connectionist acoustic modelling paradigm. The effective use of such large amounts of data is difficult due to the computational requirements of training large connectionist models. This dissertation details research aimed at increasing the performance of connectionist acoustic models through the effective use of available training data. The methods investigated are based on ensembles of models. An ensemble is a collection of models which are combined in a manner such that the performance of the ensemble is greater than that of any of the models which form the ensemble. Most ensemble methods use a simple linear combination of the model estimates to form the ensemble estimate. A data dependent ensemble technique has been developed in which the combination of the ensemble models is dependent on the current input. The use of ensembles for speaker adaptation has been investigated, and a method based on clustering of training data has been developed and implemented. This speaker adaptation scheme does not require additional adaptation data, and can reduce the error rate of a hybrid connectionist-HMM speaker independent recognition system by up to 14.5%. In addition, clustering allows effective use of large amounts of training data. Boosting is a method which makes selective use of training data, and produces an ensemble with each model trained on data drawn from a different distribution. Results on the optical character recognition task suggest that boosting can provide considerable gains in classification performance. The application of boosting to acoustic modelling has been investigated, and a modified boosting procedure developed. The boosting algorithms have been applied to multilayer perceptron acoustic models, and performance of the models assessed on a number of ARPA benchmark tasks. The results show that boosting consistently provides a 14–19% reduction in word error rate. The standard boosting techniques are not suitable for use with recurrent network acoustic models, and three new boosting algorithms have been developed for use with connectionist models with internal memory. These new boosting algorithms have also been evaluated on a number of ARPA benchmark tasks, and have been shown to lead to a reduction in word error rate of 10-18%. ii Acknowledgements There are many of my colleagues in the Speech, …
منابع مشابه
Improving of Feature Selection in Speech Emotion Recognition Based-on Hybrid Evolutionary Algorithms
One of the important issues in speech emotion recognizing is selecting of appropriate feature sets in order to improve the detection rate and classification accuracy. In last studies researchers tried to select the appropriate features for classification by using the selecting and reducing the space of features methods, such as the Fisher and PCA. In this research, a hybrid evolutionary algorit...
متن کاملSpoken Term Detection for Persian News of Islamic Republic of Iran Broadcasting
Islamic Republic of Iran Broadcasting (IRIB) as one of the biggest broadcasting organizations, produces thousands of hours of media content daily. Accordingly, the IRIBchr('39')s archive is one of the richest archives in Iran containing a huge amount of multimedia data. Monitoring this massive volume of data, and brows and retrieval of this archive is one of the key issues for this broadcasting...
متن کاملEnsemble methods for connectionist acoustic modelling
In this paper we i n v estigate a number of ensemble methods for improving the performance of connectionist acoustic models for large vocabulary continuous speech recognition. We discuss boosting, a data selection technique which results in an ensemble of models, and mixtures-of-experts. These techniques have been applied to multi-layer perceptron acoustic models used to build a hybrid connecti...
متن کاملSpeech Emotion Recognition Using Scalogram Based Deep Structure
Speech Emotion Recognition (SER) is an important part of speech-based Human-Computer Interface (HCI) applications. Previous SER methods rely on the extraction of features and training an appropriate classifier. However, most of those features can be affected by emotionally irrelevant factors such as gender, speaking styles and environment. Here, an SER method has been proposed based on a concat...
متن کاملA Comparative Study of Gender and Age Classification in Speech Signals
Accurate gender classification is useful in speech and speaker recognition as well as speech emotion classification, because a better performance has been reported when separate acoustic models are employed for males and females. Gender classification is also apparent in face recognition, video summarization, human-robot interaction, etc. Although gender classification is rather mature in a...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1997